Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added writer.wait_merging_threads call #628

Merged
merged 1 commit into from
Oct 22, 2024
Merged

Conversation

jamesbraza
Copy link
Collaborator

@jamesbraza jamesbraza added the enhancement New feature or request label Oct 22, 2024
@jamesbraza jamesbraza self-assigned this Oct 22, 2024
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Oct 22, 2024
Copy link
Collaborator

@mskarlin mskarlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this doesn't hurt performance

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 22, 2024
@jamesbraza jamesbraza merged commit bea9de0 into main Oct 22, 2024
5 checks passed
@jamesbraza jamesbraza deleted the adding-wait-merging-threads branch October 22, 2024 23:47
@cjrh
Copy link

cjrh commented Oct 24, 2024

I hope it's ok me stopping by, I just wanted to do a quick check on the merging_threads addition.

hopefully this doesn't hurt performance

In your code, it will definitely hurt async concurrency, but probably not data throughput for indexing. I don't know whether your application is sensitive to the concurrency impact or not. Handwaving some numbers, if your application usually serves say 10k concurrent connections, a little blocking code like the writer interactions you have could drop that to, say 1k if it was in a hot path. It probably isn't, but using these made up numbers that's the kind of impact blocking code has inside async event loops. (OTOH if you have very few concurrent IO-bound tasks/users then it probably doesn't matter)

wait_merging_threads makes it worse, but you already are doing blocking operations with the writer so this is an existing issue.

        async def _add_document() -> None:
            if not await self.filecheck(index_doc["file_location"], index_doc["body"]):
                try:
                    writer: IndexWriter = (await self.index).writer()
                    writer.add_document(Document.from_dict(index_doc))  # type: ignore[call-arg]
                    writer.commit()
                    writer.wait_merging_threads()

                    filehash = self.filehash(index_doc["body"])
                    (await self.index_files)[index_doc["file_location"]] = filehash
                    if document:
                        docs_index_dir = await self.docs_index_directory
                        async with await anyio.open_file(
                            docs_index_dir / f"{filehash}.{self.storage.extension()}",
                            "wb",
                        ) as f:
                            await f.write(self.storage.write_to_string(document))
                    self.changed = True
                except ValueError as e:
                    if "Failed to acquire Lockfile: LockBusy." in str(e):
                        raise AsyncRetryError("Failed to acquire lock.") from e
                    raise

The issue is that all interactions with the writer that touch disk are blocking, and wait_merging_threads is especially blocking. It doesn't normally take a long time, but even hundreds of milliseconds are going to tank your asyncio concurrency. To resolve this, you would have to perform the writer operations inside a thread or a subprocess. I had a quick look in the tantivy-py source and I don't think the writer operations release the GIL, which means you'll probably need a subprocess. This can be set up with run_in_executor.

It might look something like this (pseudocode):

def writer_stuff(index_path, index_docs):
    writer: IndexWriter = Index.open(index_path).writer(heap_size=1000 * 1000 * 1000)
    for index_doc in index_docs:
        writer.add_document(Document.from_dict(index_doc))  # type: ignore[call-arg]
        
    writer.commit()
    writer.wait_merging_threads()

And then at the call site

async def _add():
    ...
    index_docs = ...
    exe = concurrent.futures.ProcessPoolExecutor()
    loop = asyncio.get_running_loop()
    await loop.run_in_executor(exe, writer_stuff, index_path, index_docs)
    ...

This kind of code is annoying to write, because any parameters that cross the interprocess boundary do so in a pickle and that's why I pass the index_path instead of an index instance. It is what it is. Index may already be pickleable, I haven't checked.

@cjrh
Copy link

cjrh commented Oct 24, 2024

(fyi @mskarlin @jamesbraza )

@jamesbraza
Copy link
Collaborator Author

jamesbraza commented Oct 25, 2024

Hello @cjrh sorry we had some stuff that delayed me a bit on a response. Thanks for your thoughts/time here, it's much appreciated, and actually this PR turns out to be a massive win:

  • Index file count: 1000X smaller (21004 to 22 files)
  • Index size: 40% smaller (650 to 389 MiB)
  • Index build time: ~2X longer (anecdotally)
  • (Runtime) tantivy.Searcher.search performance: unchanged
  • (Runtime) tantivy.Searcher.search speed: definitely faster (but didn't quantify this)

So it turns out by not calling wait_merging_threads, we were missing out on tons of compaction.

I might suggest to tantivy-py to somehow tweak the interface on IndexWriter to make this more seamless, so other people don't miss out on this same gain.


Per your comment on multiprocessing/GIL, that is a great suggestion. I opened #646 to make sure we track this possibility

@cjrh
Copy link

cjrh commented Oct 25, 2024

good to hear 👍🏼

22 files sounds much more like what I was expecting. You should no longer be seeing the memory error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants